The bank data is anonimyzed, each row containing 20 values. Half of them are categorical and the rest numeric.
In the following we will explore the data, split train and test datasets, prepare it for a model, train a model and predict the target value for the test set. We will then try to identify the 5 most predictive values and we will use them in a new model to compare the benefits.
import datetime
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve, recall_score
from sklearn.model_selection import train_test_split, GridSearchCV
warnings.filterwarnings('ignore')
# run external scripts
%run ./plot_funcs.py
Load data and split into test and train.
%%time
df = pd.read_csv('data_set.csv', header=0)
cols = df.columns.to_list()[0].split(sep=';')
df = df[df.columns.to_list()[0]].str.split(';', expand=True)
for i, col in enumerate(cols):
cols[i] = col.replace('"', '')
df.columns = cols
# set feature types with memory optimisation
data_type = {'age': np.int8, 'duration': np.int16, 'campaign': np.int8,
'pdays': np.int16, 'previous': np.int8, 'emp.var.rate':
np.float32, 'cons.price.idx': np.float32, 'cons.conf.idx':
np.float32, 'euribor3m': np.float32, 'nr.employed': np.float32,
'y': np.int8}
df['y'].replace(to_replace={'"yes"': 1, '"no"': 0}, inplace=True)
df = df.astype(dtype=data_type, errors='raise')
Split train/test sample and categorical/numerical variables.
%%time
# replace 999 with a negative number -1
df['pdays'] = df['pdays'].where(df['pdays']!=999, -1)
# split train and test set 80-20%
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True)
categ_feat = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week',
'poutcome']
numer_feat = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx',
'cons.conf.idx', 'euribor3m', 'nr.employed']
target = 'y'
train_df.shape, test_df.shape
Train and test data have 32,950 and 8,238 entries respectivelly and 21 columns.
Let's glimpse train and test dataset.
train_df.head()
test_df.head()
Datasets contain:
Let's check if there is any missing data. We will also chech the type of data (correctly assigned after exploring features' range of values and type).
%%time
missing_data(train_df)
Here we check test dataset.
%%time
missing_data(test_df)
There are no missing data in train and test datasets. Let's check the numerical values in train and test dataset.
%%time
train_df.describe()
%time
test_df.describe()
We can make few observations here:
The number of values in train and test set is the same. Let's plot the scatter plot for train and test set for few of the features.
We will show just 25% of the train data amd all test data. On x axis we show train values and on the y axis we show the test values.
features = numer_feat
plot_feature_scatter(train_df[:len(test_df)], test_df, features)
Let's check the distribution of target value in train dataset.
sns.countplot(train_df[target], palette='Set3')
print("There are {0:.3f}% target values with 1".format(100 * train_df[target].value_counts()[1]/train_df.shape[0]))
Let's check the distribution of target value in test dataset.
sns.countplot(test_df[target], palette='Set3')
print("There are {0:.2f}% target values with 1".format(100 * test_df[target].value_counts()[1]/test_df.shape[0]))
The data is unbalanced with respect with target value. For the shake of speeding up the analysis, we will assume that we did a good job into splitting the train and test sample and they are both representative of the overall sample.
With this in mind, we will dig deeper into the investigation of categorical values on the train test.
train_df['job'].unique()
categ_feat_dfs = []
categ_feat_test_dfs = []
job_cat = pd.get_dummies(train_df['job'], prefix='job', drop_first=False)
job_cat.columns = [col.replace('"', '') for col in job_cat.columns]
job_cat.drop(['job_unknown'], axis=1, inplace=True)
job_cat_test = pd.get_dummies(test_df['job'], prefix='job', drop_first=False)
job_cat_test.columns = [col.replace('"', '') for col in job_cat_test.columns]
job_cat_test.drop(['job_unknown'], axis=1, inplace=True)
categ_feat_test_dfs.append(job_cat_test)
categ_feat_dfs.append(job_cat)
plot_feature_categ(train_df, 'job', target)
train_df['marital'].unique()
marital_cat = pd.get_dummies(train_df['marital'], prefix='marital', drop_first=False)
marital_cat.columns = [col.replace('"', '') for col in marital_cat.columns]
marital_cat.drop(['marital_unknown'], axis=1, inplace=True)
marital_cat_test = pd.get_dummies(test_df['marital'], prefix='marital', drop_first=False)
marital_cat_test.columns = [col.replace('"', '') for col in marital_cat_test.columns]
marital_cat_test.drop(['marital_unknown'], axis=1, inplace=True)
categ_feat_test_dfs.append(marital_cat_test)
categ_feat_dfs.append(marital_cat)
plot_feature_categ(train_df, 'marital', target)
train_df['education'].unique()
education_cat = pd.get_dummies(train_df['education'], prefix='education', drop_first=False)
education_cat.columns = [col.replace('"', '') for col in education_cat.columns]
education_cat.drop(['education_unknown'], axis=1, inplace=True)
education_cat_test = pd.get_dummies(test_df['education'], prefix='education', drop_first=False)
education_cat_test.columns = [col.replace('"', '') for col in education_cat_test.columns]
education_cat_test.drop(['education_unknown'], axis=1, inplace=True)
categ_feat_test_dfs.append(education_cat_test)
categ_feat_dfs.append(education_cat)
plot_feature_categ(train_df, 'education', target)
train_df['default'].unique()
default_cat = pd.get_dummies(train_df['default'], prefix='default', drop_first=False)
default_cat.columns = [col.replace('"', '') for col in default_cat.columns]
try:
default_cat.drop(['default_yes'], axis=1, inplace=True)
except:
pass
default_cat_test = pd.get_dummies(test_df['default'], prefix='default', drop_first=False)
default_cat_test.columns = [col.replace('"', '') for col in default_cat_test.columns]
try:
default_cat_test.drop(['default_yes'], axis=1, inplace=True)
except:
pass
categ_feat_test_dfs.append(default_cat_test)
categ_feat_dfs.append(default_cat)
plot_feature_categ(train_df, 'default', target)
train_df['housing'].unique()
housing_cat = pd.get_dummies(train_df['housing'], prefix='housing', drop_first=False)
housing_cat.columns = [col.replace('"', '') for col in housing_cat.columns]
housing_cat.drop(['housing_unknown'], axis=1, inplace=True)
housing_cat_test = pd.get_dummies(test_df['housing'], prefix='housing', drop_first=False)
housing_cat_test.columns = [col.replace('"', '') for col in housing_cat_test.columns]
housing_cat_test.drop(['housing_unknown'], axis=1, inplace=True)
categ_feat_test_dfs.append(housing_cat_test)
categ_feat_dfs.append(housing_cat)
plot_feature_categ(train_df, 'housing', target)
train_df['loan'].unique()
loan_cat = pd.get_dummies(train_df['loan'], prefix='loan', drop_first=False)
loan_cat.columns = [col.replace('"', '') for col in loan_cat.columns]
loan_cat.drop(['loan_unknown'], axis=1, inplace=True)
loan_cat_test = pd.get_dummies(test_df['loan'], prefix='loan', drop_first=False)
loan_cat_test.columns = [col.replace('"', '') for col in loan_cat_test.columns]
loan_cat_test.drop(['loan_unknown'], axis=1, inplace=True)
categ_feat_test_dfs.append(loan_cat_test)
categ_feat_dfs.append(loan_cat)
plot_feature_categ(train_df, 'loan', target)
train_df['contact'].unique()
contact_cat = pd.get_dummies(train_df['contact'], prefix='contact', drop_first=False)
contact_cat.columns = [col.replace('"', '') for col in contact_cat.columns]
contact_cat.drop(['contact_telephone'], axis=1, inplace=True)
contact_cat_test = pd.get_dummies(test_df['contact'], prefix='contact', drop_first=False)
contact_cat_test.columns = [col.replace('"', '') for col in contact_cat_test.columns]
contact_cat_test.drop(['contact_telephone'], axis=1, inplace=True)
categ_feat_test_dfs.append(contact_cat_test)
categ_feat_dfs.append(contact_cat)
plot_feature_categ(train_df, 'contact', target)
train_df['month'].unique()
month_cat = pd.get_dummies(train_df['month'], prefix='month', drop_first=False)
month_cat.columns = [col.replace('"', '') for col in month_cat.columns]
month_cat.drop(['month_mar'], axis=1, inplace=True)
month_cat_test = pd.get_dummies(test_df['month'], prefix='month', drop_first=False)
month_cat_test.columns = [col.replace('"', '') for col in month_cat_test.columns]
month_cat_test.drop(['month_mar'], axis=1, inplace=True)
categ_feat_test_dfs.append(month_cat_test)
categ_feat_dfs.append(month_cat)
order = ['"mar"', '"apr"', '"may"', '"jun"', '"jul"', '"aug"', '"sep"', '"oct"', '"nov"', '"dec"']
plot_feature_categ(train_df, 'month', target, order=order)
train_df['day_of_week'].unique()
day_cat = pd.get_dummies(train_df['day_of_week'], prefix='day', drop_first=False)
day_cat.columns = [col.replace('"', '') for col in day_cat.columns]
day_cat.drop(['day_fri'], axis=1, inplace=True)
day_cat_test = pd.get_dummies(test_df['day_of_week'], prefix='day', drop_first=False)
day_cat_test.columns = [col.replace('"', '') for col in day_cat_test.columns]
day_cat_test.drop(['day_fri'], axis=1, inplace=True)
categ_feat_test_dfs.append(day_cat_test)
categ_feat_dfs.append(day_cat)
order = ['"mon"', '"tue"', '"wed"', '"thu"', '"fri"']
plot_feature_categ(train_df, 'day_of_week', target, order=order)
train_df['poutcome'].unique()
poutcome_cat = pd.get_dummies(train_df['poutcome'], prefix='poutcome', drop_first=False)
poutcome_cat.columns = [col.replace('"', '') for col in poutcome_cat.columns]
poutcome_cat.drop(['poutcome_nonexistent'], axis=1, inplace=True)
poutcome_cat_test = pd.get_dummies(test_df['poutcome'], prefix='poutcome', drop_first=False)
poutcome_cat_test.columns = [col.replace('"', '') for col in poutcome_cat_test.columns]
poutcome_cat_test.drop(['poutcome_nonexistent'], axis=1, inplace=True)
categ_feat_test_dfs.append(poutcome_cat_test)
categ_feat_dfs.append(poutcome_cat)
plot_feature_categ(train_df, 'poutcome', target)
Let's show now the density plot of variables in train dataset (numerical features).
We represent with different colors the distribution for values with target value 0 and 1.
The first 10 numerical values are displayed in the following cell.
t0 = train_df.loc[train_df[target] == 0]
t1 = train_df.loc[train_df[target] == 1]
plot_feature_distribution(t0, t1, '0', '1', numer_feat)
We can observe that there is a considerable number of features with significant different distribution for the two target values.
For example, duration, pdays, previous, emp.var.rate, cons.price.idx and nr.employed.
Also some features, like emp.var.rate, cons.price.idx, cons.conf.idx, var_55 and euribor3m show a distribution that resambles to a multivariate distribution. This also is an early indicator that a tree-based model can be very performant.
We will take this into consideration in the future for the selection of the features for our prediction model.
Le't s now look to the distribution of the same features in parallel in train and test datasets.
The first 10 values are displayed in the following cell.
plot_feature_distribution(train_df, test_df, 'train', 'test', numer_feat)
The train and test seems not to be very well ballanced with respect with distribution of the numeric variables, something that could affect the performance of the final model.
Let's check the distribution of the mean values (numerical) per row in the train and test set.
plt.figure(figsize=(16,6))
features = train_df.columns.values[:-1]
plt.title("Distribution of mean values per row in the train and test set")
sns.distplot(train_df[numer_feat].mean(axis=1),color="green", kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].mean(axis=1),color="blue", kde=True,bins=50, label='test')
plt.legend()
plt.show()
Let's check the distribution of the mean values per columns in the train and test set.
plt.figure(figsize=(16,6))
plt.title("Distribution of mean values per column in the train and test set")
sns.distplot(train_df[numer_feat].mean(axis=0),color="magenta",kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].mean(axis=0),color="darkblue", kde=True,bins=50, label='test')
plt.legend()
plt.show()
Let's show the distribution of standard deviation of values per row for train and test datasets.
plt.figure(figsize=(16,6))
plt.title("Distribution of std values per row in the train and test set")
sns.distplot(train_df[numer_feat].std(axis=1),color="black", kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].std(axis=1),color="red", kde=True,bins=50, label='test')
plt.legend();plt.show()
Let's check the distribution of the standard deviation of values per columns in the train and test datasets.
plt.figure(figsize=(16,6))
plt.title("Distribution of std values per column in the train and test set")
sns.distplot(train_df[numer_feat].std(axis=0),color="blue",kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].std(axis=0),color="green", kde=True,bins=50, label='test')
plt.legend(); plt.show()
Let's check now the distribution of the mean value per row in the train dataset, grouped by value of target.
t0 = train_df.loc[train_df[target] == 0]
t1 = train_df.loc[train_df[target] == 1]
plt.figure(figsize=(16,6))
plt.title("Distribution of mean values per row in the train set")
sns.distplot(t0[numer_feat].mean(axis=1),color="red", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].mean(axis=1),color="blue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
Let's check now the distribution of the mean value per column in the train dataset, grouped by value of target.
plt.figure(figsize=(16,6))
plt.title("Distribution of mean values per column in the train set")
sns.distplot(t0[numer_feat].mean(axis=0),color="green", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].mean(axis=0),color="darkblue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
Let's check the distribution of min per row in the train and test set.
plt.figure(figsize=(16,6))
plt.title("Distribution of min values per row in the train and test set")
sns.distplot(train_df[numer_feat].min(axis=1),color="red", kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].min(axis=1),color="orange", kde=True,bins=50, label='test')
plt.legend()
plt.show()
A long variation that centres around clusters is observed.
Let's now show the distribution of min per column in the train and test set.
plt.figure(figsize=(16,6))
plt.title("Distribution of min values per column in the train and test set")
sns.distplot(train_df[numer_feat].min(axis=0),color="magenta", kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].min(axis=0),color="darkblue", kde=True,bins=50, label='test')
plt.legend()
plt.show()
Yet again, one (or very few) columns has very big values, compared to the rest (majority) of the columns.
Let's check now the distribution of max values per rows for train and test set.
plt.figure(figsize=(16,6))
plt.title("Distribution of max values per row in the train and test set")
sns.distplot(train_df[numer_feat].max(axis=1),color="brown", kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].max(axis=1),color="yellow", kde=True,bins=50, label='test')
plt.legend()
plt.show()
Let's show now the max distribution on columns for train and test set.
plt.figure(figsize=(16,6))
plt.title("Distribution of max values per column in the train and test set")
sns.distplot(train_df[numer_feat].max(axis=0),color="blue", kde=True,bins=50, label='train')
sns.distplot(test_df[numer_feat].max(axis=0),color="red", kde=True,bins=50, label='test')
plt.legend()
plt.show()
Let's show now the distributions of min values per row in train set, separated on the values of target (0 and 1).
t0 = train_df.loc[train_df[target] == 0]
t1 = train_df.loc[train_df[target] == 1]
plt.figure(figsize=(16,6))
plt.title("Distribution of min values per row in the train set")
sns.distplot(t0[numer_feat].min(axis=1),color="orange", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].min(axis=1),color="darkblue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
We show here the distribution of min values per columns in train set.
plt.figure(figsize=(16,6))
plt.title("Distribution of min values per column in the train set")
sns.distplot(t0[numer_feat].min(axis=0),color="red", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].min(axis=0),color="blue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
Let's show now the distribution of max values per row in the train set.
plt.figure(figsize=(16,6))
plt.title("Distribution of max values per row in the train set")
sns.distplot(t0[numer_feat].max(axis=1),color="gold", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].max(axis=1),color="darkblue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
Let's show also the distribution of max values per columns in the train set.
plt.figure(figsize=(16,6))
plt.title("Distribution of max values per column in the train set")
sns.distplot(t0[numer_feat].max(axis=0),color="red", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].max(axis=0),color="blue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
Let's see now what is the distribution of skew values per rows and columns.
Let's see first the distribution of skewness calculated per rows in train and test sets.
plt.figure(figsize=(16,6))
plt.title("Distribution of skew per row in the train and test set")
sns.distplot(train_df[numer_feat].skew(axis=1),color="red", kde=True,bins=20, label='train')
sns.distplot(test_df[numer_feat].skew(axis=1),color="orange", kde=True,bins=20, label='test')
plt.legend()
plt.show()
Let's see first the distribution of skewness calculated per columns in train and test set.
plt.figure(figsize=(16,6))
plt.title("Distribution of skew per column in the train and test set")
sns.distplot(train_df[numer_feat].skew(axis=0),color="magenta", kde=True,bins=30, label='train')
sns.distplot(test_df[numer_feat].skew(axis=0),color="darkblue", kde=True,bins=30, label='test')
plt.legend()
plt.show()
Let's see now what is the distribution of kurtosis values per rows and columns.
Let's see first the distribution of kurtosis calculated per rows in train and test sets.
plt.figure(figsize=(16,6))
plt.title("Distribution of kurtosis per row in the train and test set")
sns.distplot(train_df[numer_feat].kurtosis(axis=1),color="darkblue", kde=True,bins=20, label='train')
sns.distplot(test_df[numer_feat].kurtosis(axis=1),color="yellow", kde=True,bins=20, label='test')
plt.legend()
plt.show()
Let's see first the distribution of kurtosis calculated per columns in train and test sets.
plt.figure(figsize=(16,6))
plt.title("Distribution of kurtosis per column in the train and test set")
sns.distplot(train_df[numer_feat].kurtosis(axis=0),color="magenta", kde=True,bins=40, label='train')
sns.distplot(test_df[numer_feat].kurtosis(axis=0),color="green", kde=True,bins=40, label='test')
plt.legend()
plt.show()
Let's see now the distribution of skewness on rows in train separated for values of target 0 and 1.
t0 = train_df.loc[train_df[target] == 0]
t1 = train_df.loc[train_df[target] == 1]
plt.figure(figsize=(16,6))
plt.title("Distribution of skew values per row in the train set")
sns.distplot(t0[numer_feat].skew(axis=1),color="red", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].skew(axis=1),color="blue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
Let's see now the distribution of skewness on columns in train separated for values of target 0 and 1.
plt.figure(figsize=(16,6))
plt.title("Distribution of skew values per column in the train set")
sns.distplot(t0[numer_feat].skew(axis=0),color="red", kde=True,bins=30, label='target = 0')
sns.distplot(t1[numer_feat].skew(axis=0),color="blue", kde=True,bins=30, label='target = 1')
plt.legend(); plt.show()
Let's see now the distribution of kurtosis on rows in train separated for values of target 0 and 1.
plt.figure(figsize=(16,6))
plt.title("Distribution of kurtosis values per row in the train set")
sns.distplot(t0[numer_feat].kurtosis(axis=1),color="red", kde=True,bins=50, label='target = 0')
sns.distplot(t1[numer_feat].kurtosis(axis=1),color="blue", kde=True,bins=50, label='target = 1')
plt.legend(); plt.show()
Let's see now the distribution of kurtosis on columns in train separated for values of target 0 and 1.
plt.figure(figsize=(16,6))
plt.title("Distribution of kurtosis values per column in the train set")
sns.distplot(t0[numer_feat].kurtosis(axis=0),color="red", kde=True,bins=10, label='target = 0')
sns.distplot(t1[numer_feat].kurtosis(axis=0),color="blue", kde=True,bins=10, label='target = 1')
plt.legend(); plt.show()
We calculate now the correlations between the features in train set.
The following table shows the first 10 least correlated features.
%%time
correlations = train_df[numer_feat].corr().abs().unstack().sort_values(kind="quicksort").reset_index()
correlations = correlations[correlations['level_0'] != correlations['level_1']]
correlations.head(10)
Let's look to the top most correlated features, besides the same feature pairs.
correlations.tail(10)
Let's look to the top most correlated features, besides the same feature pairs, including the categorical features.
%%time
dfs_for_concat = [train_df.drop(columns=categ_feat, axis=1)]
dfs_for_concat = dfs_for_concat + categ_feat_dfs
train_df_total = pd.concat(dfs_for_concat, axis=1, join='outer')
train_df_total = train_df_total.drop(columns=['y'], axis=1)
train_df_total['y'] = train_df['y']
correlations2 = train_df_total[numer_feat].corr().abs().unstack().sort_values(kind="quicksort").reset_index()
correlations2 = correlations2[correlations2['level_0'] != correlations2['level_1']]
correlations2.head(10)
Let's look to the top most correlated features, besides the same feature pairs, including the categorical features.
correlations2.tail(10)
# creating test_df_total
dfs_for_concat = [test_df.drop(columns=categ_feat, axis=1)]
dfs_for_concat = dfs_for_concat + categ_feat_test_dfs
test_df_total = pd.concat(dfs_for_concat, axis=1, join='outer')
test_df_total = test_df_total.drop(columns=['y'], axis=1)
test_df_total['y'] = test_df['y']
Visualise correlations on Train Dataset:
corr_heatmap_plot(test_df, cmap='coolwarm')
Visualise correlations on Test Dataset:
corr_heatmap_plot(train_df, cmap='coolwarm')
The correlation between some of the features is very high. Age is particularly not very correlated with other variables.
Let's now check how many duplicate values exists per columns.
%%time
unique_max_train = []
unique_max_test = []
for feature in numer_feat:
values = train_df[feature].value_counts()
unique_max_train.append([feature, values.max(), values.idxmax()])
values = test_df[feature].value_counts()
unique_max_test.append([feature, values.max(), values.idxmax()])
Let's show the top 15 max of duplicate values per train set.
np.transpose((pd.DataFrame(unique_max_train, columns=['Feature', 'Max duplicates', 'Value'])).\
sort_values(by = 'Max duplicates', ascending=False).head(15))
Let's see also the top 15 number of duplicates values per test set.
np.transpose((pd.DataFrame(unique_max_test, columns=['Feature', 'Max duplicates', 'Value'])).\
sort_values(by = 'Max duplicates', ascending=False).head(15))
Same columns in train and test set have the same or very close number of duplicates of same or very close values. This means that our test sample is of same quality as the training sample.
Let's calculate for starting few aggregated values for the existing features.
%%time
idx = numer_feat
for dfn in [test_df_total, train_df_total]:
dfn['sum'] = dfn[idx].sum(axis=1)
dfn['min'] = dfn[idx].min(axis=1)
dfn['max'] = dfn[idx].max(axis=1)
dfn['mean'] = dfn[idx].mean(axis=1)
dfn['std'] = dfn[idx].std(axis=1)
dfn['skew'] = dfn[idx].skew(axis=1)
dfn['kurt'] = dfn[idx].kurtosis(axis=1)
dfn['med'] = dfn[idx].median(axis=1)
Let's check the new created features for the numeric values.
train_df_total[train_df_total.columns[-8:]].head()
test_df_total[test_df_total.columns[-8:]].head()
# swap target column to the end
train_df_total = train_df_total.drop(columns=['y'], axis=1)
train_df_total['y'] = train_df['y']
test_df_total = test_df_total.drop(columns=['y'], axis=1)
test_df_total['y'] = test_df['y']
Let's check the distribution of these new, engineered features.
We plot first the distribution of new features, grouped by value of corresponding target values.
t0 = train_df_total.loc[train_df_total[target] == 0]
t1 = train_df_total.loc[train_df_total[target] == 1]
features = train_df_total.columns[-9:-1] # 8 new features
plot_new_feature_distribution(t0, t1, 'target: 0', 'target: 1', features)
Let's show the distribution of new features values for train and test.
plot_new_feature_distribution(train_df_total, test_df_total, 'train', 'test', features)
We can see that the train_total and test_total have no difference in their distributions of the new features, which is good, while there is a big difference between some of the features between the two classes, like min, sum and median.
Let's check how many features we have now (should be 53 + 8 + 1 target).
print('Train and test columns: {} {}'.format(len(train_df_total.columns), len(test_df_total.columns)))
features = [c for c in train_df_total.columns if c!=target]
target = train_df_total[target]
We define the hyperparameters for the model.
ratio = len(t0)/len(t1) # ratio between the two imbalanced classes
parameters = {
'booster': ['gbtree'],
'verbosity': [0],
'max_depth': [3, 5, 7,],
'min_child_weight': [3, 5, 7],
'gamma': [0.0, 0.15, 0.3],
'subsample': [0.5, 1.0],
'objective': ['binary:logistic'],
'disable_default_eval_metric': [1],
'eval_metric': ['auc'],
'n_estimators': [1000],
'scale_pos_weight': [ratio],
'n_jobs': [8]
}
We now run GridSearchCV to identify the best hyperparameters.
grid_search = GridSearchCV(estimator=xgb.XGBClassifier(), param_grid=parameters, cv=10)
%%time
# perform grid search
grid_search.fit(train_df_total[features], target)
# get best parameters
best_estimator = grid_search.best_estimator_
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
# replace parameters
# best_parameters['verbosity'] = 1
best_parameters['verbose_eval'] = 100
print("Best Estimator:", best_estimator)
print("Best Accuracy:", best_accuracy)
print("Best Parameters:", best_parameters)
print("Balance Ratio:", round(ratio, 3))
We run the 1st cross-validated model and test the performance on the test set.
%%time
clf1 = xgb.XGBClassifier(**best_parameters)
tr_val = (train_df_total[features], target)
ts_val = (test_df_total[features], test_df_total['y'])
clf1.fit(train_df_total[features], target, eval_metric='auc', eval_set=[tr_val, ts_val])
true_flag = test_df_total['y']
pred_flag = clf1.predict(test_df_total[features])
print("AUC score: {:<8.3f}".format(roc_auc_score(true_flag, pred_flag)))
Check confusion matrix, AUC plot and other metrics
metrics = confusion_mat_plot(true_flag, pred_flag)
gini = auc_plot(test_df_total[features], true_flag, clf1)
# record metrics
model_comparison = pd.DataFrame()
model_comparison['With Eng Feat'] = [gini, metrics[0], metrics[1], metrics[2], metrics[3], metrics[4]]
model_comparison.index = ['Gini', 'Accuracy', 'Precision', 'Recall', 'Specificity', 'F-Score']
Let's check the feature importance.
feature_importance_df = pd.DataFrame()
feature_importance_df["Feature"] = features
feature_importance_df["importance"] = clf1.feature_importances_
# plot results
plt.figure(figsize=(12,12))
sns.barplot(x="importance", y="Feature", data=feature_importance_df.sort_values(by="importance",ascending=False))
plt.title('Features importance')
plt.tight_layout()
plt.show()
It seems that most of the features that we engineered are the most important ones to decide the split.
So, as for the next steps, we will build three models, with the same hyperparameters for training and compare the performance:
One with the top 5 features from the previous model
We will try to make our model more explainable, and check if we sacrifice in performance. This means that we will try the same model for dataset that does not contain the engineered features and check performance
Next we will also choose the top 5 features from that model too
%%time
clf2 = xgb.XGBClassifier(**best_parameters)
top_5_feat = feature_importance_df.sort_values(by="importance",ascending=False)['Feature'].values[:5]
tr_val = (train_df_total[top_5_feat], target)
ts_val = (test_df_total[top_5_feat], test_df_total['y'])
clf2.fit(train_df_total[top_5_feat], target, eval_metric='auc', eval_set=[tr_val, ts_val])
true_flag = test_df_total['y']
pred_flag = clf2.predict(test_df_total[top_5_feat])
print("AUC score: {:<8.3f}".format(roc_auc_score(true_flag, pred_flag)))
Check confusion matrix, AUC plot and other metrics
metrics = confusion_mat_plot(true_flag, pred_flag)
gini = auc_plot(test_df_total[top_5_feat], true_flag, clf2)
# record new metrics
model_comparison['With Eng Feat-Top 5'] = [gini, metrics[0], metrics[1], metrics[2], metrics[3], metrics[4]]
Here we can clearly see that we are losing in performance by a little in our simplified model, but we are having a less complicated model.
Let's check again the potential changes in feature importance (with engineered features).
feature_importance_df2 = pd.DataFrame()
feature_importance_df2["Feature"] = top_5_feat
feature_importance_df2["importance"] = clf2.feature_importances_
# plot results
plt.figure(figsize=(12,12))
sns.barplot(x="importance", y="Feature", data=feature_importance_df2.sort_values(by="importance",ascending=False))
plt.title('Features importance')
plt.tight_layout()
plt.show()
We will now try to use only the original processed data, without the engineered features.
%%time
clf3 = xgb.XGBClassifier(**best_parameters)
tr_val = (train_df_total[train_df_total.columns[:-9]], target)
ts_val = (test_df_total[train_df_total.columns[:-9]], test_df_total['y'])
clf3.fit(train_df_total[train_df_total.columns[:-9]], target, eval_metric='auc', eval_set=[tr_val, ts_val])
true_flag = test_df_total['y']
pred_flag = clf3.predict(test_df_total[train_df_total.columns[:-9]])
print("AUC score: {:<8.3f}".format(roc_auc_score(true_flag, pred_flag)))
Check confusion matrix, AUC plot and other metrics
metrics = confusion_mat_plot(true_flag, pred_flag)
gini = auc_plot(test_df_total[train_df_total.columns[:-9]], true_flag, clf3)
# record new metrics
model_comparison['Without Eng Feat'] = [gini, metrics[0], metrics[1], metrics[2], metrics[3], metrics[4]]
Let's check the new feature importance (without engineered features).
feature_importance_df3 = pd.DataFrame()
feature_importance_df3["Feature"] = train_df_total.columns.to_list()[:-9]
feature_importance_df3["importance"] = clf3.feature_importances_
# plot results
plt.figure(figsize=(12,12))
sns.barplot(x="importance", y="Feature", data=feature_importance_df3.sort_values(by="importance",ascending=False))
plt.title('Features importance')
plt.tight_layout()
plt.show()
%%time
clf4 = xgb.XGBClassifier(**best_parameters)
top_5_feat_2 = feature_importance_df3.sort_values(by="importance",ascending=False)['Feature'].values[:5]
tr_val = (train_df_total[top_5_feat_2], target)
ts_val = (test_df_total[top_5_feat_2], test_df_total['y'])
clf4.fit(train_df_total[top_5_feat_2], target, eval_metric='auc', eval_set=[tr_val, ts_val])
true_flag = test_df_total['y']
pred_flag = clf4.predict(test_df_total[top_5_feat_2])
print("AUC score: {:<8.3f}".format(roc_auc_score(true_flag, pred_flag)))
Check confusion matrix, AUC plot and other metrics
metrics = confusion_mat_plot(true_flag, pred_flag)
gini = auc_plot(test_df_total[top_5_feat_2], true_flag, clf4)
# record new metrics
model_comparison['Without Eng Feat-Top 5'] = [gini, metrics[0], metrics[1], metrics[2], metrics[3], metrics[4]]
We observe again a small sacrifice in performance, however, we have a less complicated model.
Let's check again the potential changes in feature importance (without engineered features).
feature_importance_df4 = pd.DataFrame()
feature_importance_df4["Feature"] = top_5_feat_2
feature_importance_df4["importance"] = clf4.feature_importances_
# plot results
plt.figure(figsize=(12,12))
sns.barplot(x="importance", y="Feature", data=feature_importance_df4.sort_values(by="importance",ascending=False))
plt.title('Features importance')
plt.tight_layout()
plt.show()
As we can see, the most predictive features that define the model's decision are the rr.employed and the result of the previous campaign, the feature previous_success.
There is a lot more analysis that can be done to find the best-tuned model for the topic. Other areas that we could touch to improve the model are:
Below we can see the cummulative results of all models:
%%time
# create all model arguments
X_test = [test_df_total[features], test_df_total[top_5_feat],
test_df_total[train_df_total.columns[:-9]], test_df_total[top_5_feat_2]]
y_test = [true_flag, true_flag, true_flag, true_flag]
clf = [clf1, clf2, clf3, clf4]
model_names = model_comparison.columns.to_list()
model_weights = [None, None, None, None]
%time
auc_plot_all(X_test, y_test, clf, model_names, w=model_weights)
%%time
plot_models_radar(model_comparison)
%%time
metrics_slider_plot(X_test, y_test, clf, model_names)